This kernel hopes to accomplish many goals, to name a few...
Learn/review/explain complex data science topics through write-ups. Do a comprehensive data analysis along with visualizations. Create models that are well equipped to predict better sale price of the houses.
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
As the name suggests, this kernel goes on a detailed analysis journey of most of the regression algorithms. In addition to that, this kernel uses many charts and images to make things easier for readers to understand.
Define the Problem:
Gather the Data:
Prepare Data for Consumption:
Perform Exploratory Analysis:
Model Data:
Validate and Implement Data Model:
Optimize and Strategize:
from IPython.display import Image
Image("/Users/tuktuk/Downloads/1*DjIccrMeRWmrC_mCUOGDhw.png") # Image 1
In this problem we have to predict house prices on the basis of independent variables. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
For this project, the problem statement is given to us on a golden plater, develop an algorithm to predict the house prices.
Data is downloaded from Kaggle. "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data" It consists of train, test and submission datasets in csv formats
Since step 2 was provided to us , Therefore, normal processes in data wrangling, such as data architecture, governance, and extraction are out of scope. Thus, only data cleaning is in scope.
We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.
# NumPy
import numpy as np # linear algebra
# Dataframe operations
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
#Visualization
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import matplotlib.style as style
style.use('fivethirtyeight')
%matplotlib inline
# Feature Scaling
from sklearn.feature_selection import VarianceThreshold
from statsmodels.stats.outliers_influence import variance_inflation_factor
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
# Scalers
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from datetime import datetime
from scipy.stats import skew # for some statistics
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
import sklearn.linear_model as linear_model
from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from mlxtend.regressor import StackingCVRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
import scipy.stats as stats
import matplotlib.style as style
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import missingno as msno
import os
print(os.listdir("/Users/tuktuk/Downloads/house-prices-advanced-regression-techniques"))
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import warnings
warnings.filterwarnings('ignore')
## Import Training data.
train = pd.read_csv("/Users/tuktuk/Downloads/house-prices-advanced-regression-techniques/train.csv")
train.head()
test = pd.read_csv("/Users/tuktuk/Downloads/house-prices-advanced-regression-techniques/test.csv")
test.head()
# making submission dataframe with test Id
submission = pd.DataFrame()
submission['Id'] = test['Id']
submission.head(5)
train.drop('Id', axis=1, inplace = True)
test.drop('Id', axis=1, inplace = True)
data = (train, test) # to perform actions on both datasets
# To perform actions on both sets together ( for data imputation and encoding)
train_test=pd.concat([train,test],axis=0,sort=False)
train_test.head()
print(train.shape, test.shape, train_test.shape)
print (f"Train has {train.shape[0]} rows and {train.shape[1]} columns")
print (f"Test has {test.shape[0]} rows and {test.shape[1]} columns")
print (f"Train_Test has {train_test.shape[0]} rows and {train_test.shape[1]} columns")
Data fields
Here's a brief version of what you'll find in the data description file.
SalePrice - the property's sale price in dollars. This is the target variable that we are trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale
train.info() # to get feel of the data( number,names, type, missing values)
# count of different types of objects.
(train.dtypes).value_counts()
# count of different types of objects.
(test.dtypes).value_counts()
# count of different types of objects.
(train_test.dtypes).value_counts()
test.info()
train.describe() # statistical feel (min, max, mean, std, 25% percentile, 50% percentile, 75% percentile)
# Variable wise ( by transposing above)
train.describe().T
train.describe(include = 'O') # For categorical columns
train.describe(include = 'O').T
train_test.describe(include = 'O').T
# list of numerical variables
num_vars = [var for var in train.columns if train[var].dtypes != 'O' and var not in ['Id','SalePrice' ]]
print('Number of numerical variables: ', num_vars, len(num_vars))
#list of discrete variables
discrete_vars = [var for var in num_vars if len(train[var].unique())<20 and var not in ['Id','SalePrice']]
print('Number of discrete variables: ', discrete_vars, len(discrete_vars))
# list of continuous variables
cont_vars = [var for var in num_vars if var not in discrete_vars+['Id','SalePrice']]
print('Number of continuous variables: ', cont_vars, len(cont_vars))
### Categorical variables
cat_vars = [var for var in train.columns if train[var].dtypes=='O']
print('Number of categorical variables: ', cat_vars, len(cat_vars))
# Checking cardianility
for var in cat_vars:
print(var, "---------", len(train[var].unique()), ' categories')
# Lets check range and unique values
# discrete data unique values
for dataset in data:
for var in discrete_vars:
print(var, dataset[var].unique())
# continuous data range
for dataset in data:
for var in cont_vars:
print(var, dataset[var].min(),'-',dataset[var].max())
# categorical data range
for dataset in data:
for var in cat_vars:
print(var, "------", dataset[var].unique(), '\n')
def missing_percentage(df):
"""This function takes a DataFrame(df) as input and returns two columns, total missing values and total missing values percentage"""
## the two following line may seem complicated but its actually very simple.
total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2)[round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2) != 0]
return pd.concat([total, percent], axis=1, keys=['Total','Percent'])
missing_percentage(train)
# Visualising NaN values
plt.figure(figsize = (12,12))
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')
missing_percentage(test)
# Visualising NaN values
plt.figure(figsize = (12,12))
sns.heatmap(test.isnull(), yticklabels=False, cbar=False, cmap='viridis')
missing_percentage(train_test)
# Visualising NaN values
plt.figure(figsize = (12,12))
sns.heatmap(train_test.isnull(), yticklabels=False, cbar=False, cmap='viridis')
There are multiple types of features.
Some features have missing values.
Most of the features are object( includes string values in the variable).
I want to focus on the target variable which is SalePrice. Let's create a histogram to see if the target variable is Normally distributed. If we want to create any linear model, it is essential that the features are normally distributed. This is one of the assumptions of multiple linear regression. I will explain more on this later.
In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem. I know this is time-consuming, but it will give us the flavour of our dataset.
In order to have some discipline in our analysis, we can create an Excel spreadsheet with the following columns:
Variable - Variable name.
Type - Identification of the variables' type. There are two possible values for this field: 'numerical' or 'categorical'. By 'numerical' we mean variables for which the values are numbers, and by 'categorical' we mean variables for which the values are categories.
Segment - Identification of the variables' segment. We can define three possible segments: building, space or location. When we say 'building', we mean a variable that relates to the physical characteristics of the building (e.g. 'OverallQual'). When we say 'space', we mean a variable that reports space properties of the house (e.g. 'TotalBsmtSF'). Finally, when we say a 'location', we mean a variable that gives information about the place where the house is located (e.g. 'Neighborhood').
Expectation - Our expectation about the variable influence in 'SalePrice'. We can use a categorical scale with 'High', 'Medium' and 'Low' as possible values.
Conclusion - Our conclusions about the importance of the variable, after we give a quick look at the data. We can keep with the same categorical scale as in 'Expectation'.
Comments - Any general comments that occured to us.
While 'Type' and 'Segment' is just for possible future reference, the column 'Expectation' is important because it will help us develop a 'sixth sense'. To fill this column, we should read the description of all the variables and, one by one, ask ourselves:
Do we think about this variable when we are buying a house? (e.g. When we think about the house of our dreams, do we care about its 'Masonry veneer type'?). If so, how important would this variable be? (e.g. What is the impact of having 'Excellent' material on the exterior instead of 'Poor'? And of having 'Excellent' instead of 'Good'?). Is this information already described in any other variable? (e.g. If 'LandContour' gives the flatness of the property, do we really need to know the 'LandSlope'?). After this daunting exercise, we can filter the spreadsheet and look carefully to the variables with 'High' 'Expectation'. Then, we can rush into some scatter plots between those variables and 'SalePrice', filling in the 'Conclusion' column which is just the correction of our expectations.
I went through this process and concluded that the following variables can play an important role in this problem:
OverallQual (which is a variable that I don't like because I don't know how it was computed; a funny exercise would be to predict 'OverallQual' using all the other variables available).
YearBuilt.
TotalBsmtSF.
GrLivArea.
I ended up with two 'building' variables ('OverallQual' and 'YearBuilt') and two 'space' variables ('TotalBsmtSF' and 'GrLivArea'). This might be a little bit unexpected as it goes against the real estate mantra that all that matters is 'location, location and location'. It is possible that this quick data examination process was a bit harsh for categorical variables. For example, I expected the 'Neigborhood' variable to be more relevant, but after the data examination I ended up excluding it. Maybe this is related to the use of scatter plots instead of boxplots, which are more suitable for categorical variables visualization. The way we visualize data often influences our conclusions.
However, the main point of this exercise was to think a little about our data and expectactions, so I think we achieved our goal. Now it's time for 'a little less conversation, a little more action please'. Let's shake it!
# defining customised plots to see distribution, probability and boxplot
def diagnostic_plots(df, variable):
# function takes a dataframe (df) and
# the variable of interest as arguments
# define figure size
plt.figure(figsize=(16, 4))
# histogram
plt.subplot(1, 3, 1)
sns.distplot(df[variable], bins=30)
plt.title('Histogram')
# Q-Q plot
plt.subplot(1, 3, 2)
stats.probplot(df[variable], dist="norm", plot=plt)
plt.ylabel('Variable quantiles')
# boxplot
plt.subplot(1, 3, 3)
sns.boxplot(y=df[variable])
plt.title('Boxplot')
plt.show()
for var in cont_vars:
print("Train", var)
diagnostic_plots(train, var)
print("Test", var)
diagnostic_plots(test, var)
# defining outlier function to find outliers
def outlier_function(df, col_name):
''' this function detects first and third quartile and interquartile range for a given column of a dataframe
then calculates upper and lower limits to determine outliers conservatively
returns the number of lower and uper limit and number of outliers respectively
'''
first_quartile = np.percentile(np.array(df[col_name].tolist()), 25)
third_quartile = np.percentile(np.array(df[col_name].tolist()), 75)
IQR = third_quartile - first_quartile
upper_limit = third_quartile+(1.5*IQR)
lower_limit = first_quartile-(1.5*IQR)
outlier_count = 0
for value in df[col_name].tolist():
if (value < lower_limit) | (value > upper_limit):
outlier_count +=1
return lower_limit, upper_limit, outlier_count
# finding outliers in train and test data
# loop through all columns to see if there are any outliers
outlier_columns=[]
for dataset in data:
for column in cont_vars:
if outlier_function(dataset, column)[2] > 0:
outlier_columns.append(column)
print( "There are {} outliers in {}".format(outlier_function(dataset, column)[2], column))
print("Lower limit is {} & Upper limit is {} in {}".format(outlier_function(dataset, column)[0],
outlier_function(dataset, column)[1] ,column))
print('\n')
print(outlier_columns)
# Before moving ahead, lets save our original data
train_orig = train.copy(deep=True)
test_orig = test.copy(deep=True)
Missing Data Imputation
Categorical Data Encoding.
Continuous Data Transformation to make it normalised.
Discretisation.
Outlier Handling.
Feature Selection.
train['SalePrice'].describe()
diagnostic_plots(train, 'SalePrice')
Our target variable, SalePrice is not normally distributed.
Our target variable is right-skewed.
There are multiple outliers in the variable.
#skewness and kurtosis
print("Skewness: " + str(train['SalePrice'].skew()))
print("Kurtosis: " + str(train['SalePrice'].kurt()))
Skewness is the degree of distortion from the symmetrical bell curve or the normal curve.
So, a symmetrical distribution will have a skewness of "0".
There are two types of Skewness: Positive and Negative.
Positive Skewness(similar to our target variable distribution) means the tail on the right side of the distribution is longer and fatter.
In positive Skewness the mean and median will be greater than the mode similar to this dataset. Which means more houses were sold by less than the average price.
Negative Skewness means the tail on the left side of the distribution is longer and fatter.
In negative Skewness the mean and median will be less than the mode.
Skewness differentiates in extreme values in one versus the other tail.
from IPython.display import Image
Image("/Users/tuktuk/Downloads/1*nj-Ch3AUFmkd0JUSOW_bTQ.jpeg") # Image 2
In probability theory and statistics, Kurtosis is the measure of the "tailedness" of the probability. distribution of a real-valued random variable. So, In other words, it is the measure of the extreme values(outliers) present in the distribution.
There are three types of Kurtosis: Mesokurtic, Leptokurtic, and Platykurtic.
Mesokurtic is similar to the normal curve with the standard value of 3. This means that the extreme values of this distribution are similar to that of a normal distribution.
Leptokurtic Example of leptokurtic distributions are the T-distributions with small degrees of freedom.
Platykurtic: Platykurtic describes a particular statistical distribution with thinner tails than a normal distribution. Because this distribution has thin tails, it has fewer outliers (e.g., extreme values three or more standard deviations from the mean) than do mesokurtic and leptokurtic distributions.
Image("/Users/tuktuk/Downloads/KurtosisPict.jpg") # Image 3
## Getting the correlation of all the features with target variable.
(train.corr()**2)["SalePrice"].sort_values(ascending = False)[1:]
def customized_scatterplot(y, x):
## Sizing the plot.
style.use('fivethirtyeight')
plt.subplots(figsize = (12,8))
## Plotting target variable with predictor variable(OverallQual)
sns.scatterplot(y = y, x = x);
for var in cont_vars:
customized_scatterplot(train['SalePrice'], train[var])
def distributionplot(x):
## Sizing the plot.
style.use('fivethirtyeight')
plt.subplots(figsize = (12,8))
## Plotting distribution plot)
sns.kdeplot(x);
### Visualising their distribution
for var in cont_vars:
distributionplot(train[var])
discrete_vars
for var in discrete_vars:
print(var, train[var].value_counts())
for var in discrete_vars:
print(var, train[var].value_counts(), (train[var].value_counts()/len(train)))
def cutomboxplot(df, x):
## Sizing the plot.
style.use('fivethirtyeight')
plt.subplots(figsize = (12,8))
## Plotting distribution plot)
sns.boxplot(x=x, y = df.SalePrice, data =df);
for var in discrete_vars:
cutomboxplot(train,var )
for var in cat_vars:
print(var, train[var].value_counts(), (train[var].value_counts()/len(train)))
for var in cat_vars:
cutomboxplot(train, var)
'In the very beginning there was nothing except for a plasma soup. What is known of these brief moments in time, at the start of our study of cosmology, is largely conjectural. However, science has devised some sketch of what probably happened, based on what is known about the universe today.' (source: http://umich.edu/~gs265/bigbang.htm)
To explore the universe, we will start with some practical recipes to make sense of our 'plasma soup':
Correlation matrix (heatmap style).
'SalePrice' correlation matrix (zoomed heatmap style).
Scatter plots between the most correlated variables (move like Jagger style).
correlation_train=train.corr()
sns.set(font_scale=2)
plt.figure(figsize = (50,35))
ax = sns.heatmap(correlation_train, annot=True,annot_kws={"size": 25},fmt='.1f',cmap='PiYG', linewidths=.5)
correlation_train = train.corr()
corr_dict=correlation_train['SalePrice'].sort_values(ascending=False).to_dict()
important_columns=[]
for key,value in corr_dict.items():
if ((value>0.1) & (value<0.8)) | (value<=-0.1):
important_columns.append(key)
important_columns
important_columns1=[]
for key,value in corr_dict.items():
if ((value>0.5) & (value<0.8)) | (value<=-0.5):
important_columns1.append(key)
important_columns1
'OverallQual', 'GrLivArea' and 'TotalBsmtSF' are strongly correlated with 'SalePrice'. Check! 'GarageCars' and 'GarageArea' are also some of the most strongly correlated variables. However, the number of cars that fit into the garage is a consequence of the garage area. 'GarageCars' and 'GarageArea' are like twin brothers. You'll never be able to distinguish them. Therefore, we just need one of these variables in our analysis (we can keep 'GarageCars' since its correlation with 'SalePrice' is higher). 'TotalBsmtSF' and '1stFloor' also seem to be twin brothers. We can keep 'TotalBsmtSF' just to say that our first guess was right (re-read 'So... What can we expect?'). 'FullBath'?? Really? 'TotRmsAbvGrd' and 'GrLivArea', twin brothers again. Is this dataset from Chernobyl? Ah... 'YearBuilt'..and 'YearRemodAdd'. It seems that 'YearBuilt' is slightly correlated with 'SalePrice'. Honestly, it scares me to think about 'YearBuilt' because I start feeling that we should do a little bit of time-series analysis to get this right. I'll leave this as a homework for you.
### Lets plot scatter plot for these columns (using pair plot)
#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt', 'YearRemodAdd']
sns.pairplot(train[cols], size = 2.5)
plt.show();
One of the figures we may find interesting is the one between 'TotalBsmtSF' and 'GrLiveArea'. In this figure we can see the dots drawing a linear line, which almost acts like a border. It totally makes sense that the majority of the dots stay below that line. Basement areas can be equal to the above ground living area, but it is not expected a basement area bigger than the above ground living area (unless you're trying to buy a bunker).
The plot concerning 'SalePrice' and 'YearBuilt' can also make us think. In the bottom of the 'dots cloud', we see what almost appears to be a shy exponential function (be creative). We can also see this same tendency in the upper limit of the 'dots cloud' (be even more creative). Also, notice how the set of dots regarding the last years tend to stay above this limit (I just wanted to say that prices are increasing faster now).
customized_scatterplot(train.SalePrice, train.GrLivArea)
customized_scatterplot(train.SalePrice, train.GarageArea);
customized_scatterplot(train.SalePrice, train.TotalBsmtSF)
customized_scatterplot(train.SalePrice, train['1stFlrSF']);
customized_scatterplot(train.SalePrice, train.MasVnrArea);
Our target variable shows an unequal level of variance across most predictor(independent) variables. This is called Heteroscedasticity(more explanation below) and is a red flag for the multiple linear regression model. There are many outliers in the scatter plots above that took my attention.
The two on the top-right edge of SalePrice vs. GrLivArea seem to follow a trend, which can be explained by saying that "As the prices increased, so did the area.
However, The two on the bottom right of the same chart do not follow any trends. We will get rid of these two below.
As we look through the scatter plots, I realized that it is time to explain the assumptions of Multiple Linear Regression. Before building a multiple linear regression model, we need to check that these assumptions below are valid.
Linearity ( Correct functional form )
Homoscedasticity ( Constant Error Variance )( vs Heteroscedasticity ).
Independence of Errors ( vs Autocorrelation )
Multivariate Normality ( Normality of Errors )
No or little Multicollinearity.
Since we fit a linear model, we assume that the relationship is linear, and the errors, or residuals, are pure random fluctuations around the true line. We expect that the variability in the response(dependent) variable doesn't increase as the value of the predictor(independent) increases, which is the assumptions of equal variance, also known as Homoscedasticity. We also assume that the observations are independent of one another(No Multicollinearity), and a correlation between sequential observations or auto-correlation is not there. Now, these assumptions are prone to happen altogether. In other words, if we see one of these assumptions in the dataset, it's more likely that we may come across with others mentioned above. Therefore, we can find and fix various assumptions with a few unique techniques. So, How do we check regression assumptions? We fit a regression line and look for the variability of the response data along the regression line. Let's apply this to each one of them. Linearity(Correct functional form): Linear regression needs the relationship between each independent variable and the dependent variable to be linear. The linearity assumption can be tested with scatter plots. The following two examples depict two cases, where no or little linearity is present.
## Plot sizing.
fig, (ax1, ax2) = plt.subplots(figsize = (12,8), ncols=2,sharey=False)
## Scatter plotting for SalePrice and GrLivArea.
sns.scatterplot( x = train.GrLivArea, y = train.SalePrice, ax=ax1)
## Putting a regression line.
sns.regplot(x=train.GrLivArea, y=train.SalePrice, ax=ax1)
## Scatter plotting for SalePrice and MasVnrArea.
sns.scatterplot(x = train.MasVnrArea,y = train.SalePrice, ax=ax2)
## regression line for MasVnrArea and SalePrice.
sns.regplot(x=train.MasVnrArea, y=train.SalePrice, ax=ax2);
plt.subplots(figsize = (12,8))
sns.residplot(train.GrLivArea, train.SalePrice);
Let's break this down.
# Homoscedasticity example
Image("/Users/tuktuk/Downloads/415147.image1.jpg") # Image 4
diagnostic_plots(train, 'SalePrice')
missing_percentage(train_test[cont_vars])
sns.distplot(train_test['LotFrontage']);
train['LotFrontage'].dtypes,test['LotFrontage'].dtypes,test['LotFrontage'].min(), test['LotFrontage'].max()
#LotFrontage- filling missing values with mean with groupby with Neighbourhood, and interpolating, and as type int 64
train_test['LotFrontage'] = train_test['LotFrontage'].fillna(train_test.groupby('Neighborhood')['LotFrontage'].transform('mean'))
train_test['LotFrontage'].interpolate(method='linear',inplace=True)
train_test['LotFrontage']=train_test['LotFrontage'].astype(int)
sns.distplot(train['GarageYrBlt'])
# GarageYrBlt, This column must be blank for no Garages, so we leave it blank and fill it with 0
train_test['GarageYrBlt'].fillna(0, inplace = True)
sns.distplot(train['MasVnrArea'])
train['MasVnrArea'].dtypes,test['MasVnrArea'].dtypes,test['MasVnrArea'].min(), test['MasVnrArea'].max()
# Lets fill it with mean value as per MasVnrType and as int64
train_test['MasVnrArea'] = train_test['MasVnrArea'].fillna(train_test.groupby('MasVnrType')['MasVnrArea'].transform('mean'))
train_test['MasVnrArea'].interpolate(method='linear',inplace=True)
train_test['MasVnrArea']=train_test['MasVnrArea'].astype(int)
GarageArea 1 0.07 (GarageArea: Size of garage in square feet)
'TotalBsmtSF' - Total square feet of basement area
BsmtFinSF1 1 0.07 (BsmtFinSF1: Type 1 finished square feet)
BsmtFinSF2 1 0.07 (BsmtFinSF2: Type 2 finished square feet)
BsmtUnfSF 1 0.07 (BsmtUnfSF: Unfinished square feet of basement area)
# This means that missing values point to houses without basement and garages, lets fill them with 0
cols = ['GarageArea','TotalBsmtSF','BsmtFinSF1', 'BsmtFinSF2','BsmtUnfSF' ]
for col in cols:
train_test[col].fillna(0,inplace = True)
discrete_vars
missing_percentage(train_test[discrete_vars])
# Seeing the properties of these variables, we can conclude that they can be fill with 0
cols = [ 'BsmtHalfBath','BsmtFullBath', 'GarageCars']
for col in cols:
train_test[col].fillna(0, inplace = True)
missing_percentage(train_test[cat_vars])
PoolQC - : Pool quality (NaN means pool not available)
MiscFeature - Miscellaneous feature not covered in other categories (NaN means 0)
Alley - Type of alley access (NaN means no Alley Access)
Fence - Fence quality ( NaN means Fence not available)
FireplaceQ - Fireplace quality (NaN means Fireplace not available)
GarageCond , GarageQual, GarageFinish & GarageType have 81 missing values in train data, and 78 in test data(Garage Type 76), these four varaible NaN values mean Garage absent
Similarly BsmtCond ,BsmtQual, BsmtExposure , BsmtFinType2 , BsmtFinType1 NaN values mean basement absent
MasVnrType- Masonry veneer type (NaN means Masonry veneer absent)
Electrical - Electrical system ( It cannot be missing, has to be there, lets fill it with most frequent)
MSZoning - The general zoning classification ( cannot be missing , lets fill it with most frequent groupby neighbourhood)
Utilities - Type of utilities available (NaN means no utility)
Functional - Home functionality rating (NaN means not available)
KitchenQual - Kitchen quality (NaN means , not available, can be filled with most frequent as per neighbourhood)
SaleType - Type of sale (NaN means not available, can be filled with most frequent as per neighbourhood)
Exterior2nd - Exterior covering on house (if more than one material) (NaN means not available)
Exterior1st - Exterior covering on house (NaN means not available)
# Lets fill them as per above insight
cols = ['PoolQC', 'MiscFeature','Alley' ,'Fence', 'GarageCond' , 'GarageQual',
'GarageFinish','GarageType','BsmtCond' ,'BsmtQual', 'BsmtExposure' , 'BsmtFinType2' ,
'BsmtFinType1','MasVnrType', 'FireplaceQu']
for col in cols:
train_test[col].fillna('None', inplace = True)
cols = ['Electrical', 'KitchenQual', 'Utilities', 'MSZoning', 'Functional', 'SaleType', 'Exterior1st',
'Exterior2nd']
for col in cols :
train_test[col].fillna(train_test[col].mode()[0], inplace = True)
train_test.isna().sum()
train_test.columns
year Built + Year remodified
Total square feet by adding basement, first floor and seconf floor
Total number of bath
Total porch area by adding open porch, 3Ssnporch, enclosedporch, screenporch and wood deck
House age 1 =( year sold - yesr built)
House Age2 = (year sold = Year modified)
train_test['YrBltRmd']=train_test['YearBuilt']+train_test['YearRemodAdd']
train_test['Total_Square_Feet'] = (train_test['BsmtFinSF1'] + train_test['BsmtFinSF2'] + train_test['1stFlrSF'] + train_test['2ndFlrSF'] + train_test['TotalBsmtSF'])
train_test['Total_Bath'] = (train_test['FullBath'] + (0.5 * train_test['HalfBath']) + train_test['BsmtFullBath'] + (0.5 * train_test['BsmtHalfBath']))
train_test['Total_Porch_Area'] = (train_test['OpenPorchSF'] + train_test['3SsnPorch'] + train_test['EnclosedPorch'] + train_test['ScreenPorch'] + train_test['WoodDeckSF'])
train_test['House_Age1']=train_test['YrSold']-train_test['YearBuilt']+1
train_test['House_Age2']=train_test['YrSold']-train_test['YearRemodAdd']+2
train_test['exists_pool'] = train_test['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
train_test['exists_IstFl'] = train_test['1stFlrSF'].apply(lambda x: 1 if x > 0 else 0)
train_test['exists_2ndFl'] = train_test['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
train_test['exists_garage'] = train_test['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
train_test['exists_fireplace'] = train_test['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
train_test['exists_bsmt'] = train_test['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
train_test['old_house'] = train_test['YearBuilt'].apply(lambda x: 1 if x <1990 else 0)
(train.dtypes).value_counts(), (test.dtypes).value_counts(), (train_test.dtypes).value_counts()
#### We can see that out of total 47 numerical variables train has 34 int64 & 3 float64 types, whereas
#### test has 25 int64 and 11 float64
# Lets correct this first
for i in train_test.columns:
if 'SalePrice' not in i:
if 'object' in str(train_test[str(i)].dtype):
train_test[str(i)]=train_test[str(i)].fillna(method='ffill')
(train_test.dtypes).value_counts()
### Categorical variables
cat_vars = [var for var in train_test.columns if train_test[var].dtypes=='O']
print('Number of categorical variables: ', cat_vars, len(cat_vars))
from sklearn.preprocessing import LabelEncoder
# Encoding using LabelEncoder
for col in cat_vars:
lbl_enc = LabelEncoder()
lbl_enc.fit(list(train_test[col].values))
train_test[col] = lbl_enc.transform(list(train_test[col].values))
(train_test.dtypes).value_counts()
numeric_features = train_test.dtypes[train_test.dtypes != "object"].index
skewed_features = train_test[numeric_features].apply(lambda x: skew(x)).sort_values(ascending=False)
print(skewed_features)
high_skewness = skewed_features[abs(skewed_features) > 0.9]
skewed_features = high_skewness.index
print(high_skewness)
print('\nVariables with high skewness: \n\n',skewed_features)
train_test[skewed_features].head()
for feature in skewed_features:
train_test[feature] = boxcox1p(train_test[feature], boxcox_normmax(train_test[feature] + 1))
train_test[skewed_features].head()
train_test=pd.get_dummies(train_test,dtype='int8')
train_test.isnull().sum(),(train_test.dtypes).value_counts()
train=train_test[0:1460]
test=train_test[1460:2919]
# using interpolate for equal space (linear)
train.interpolate(method='linear',inplace=True)
test.interpolate(method='linear',inplace=True)
# Finding 30 variables with correlation with SalePrice in descending order
corr_new_train=train.corr()
plt.figure(figsize=(5,15))
sns.heatmap(corr_new_train[['SalePrice']].sort_values(by=['SalePrice'],ascending=False).head(30),annot_kws={"size": 16},vmin=-1, cmap='PiYG', annot=True)
sns.set(font_scale=2)
corr_dict2=corr_new_train['SalePrice'].sort_values(ascending=False).to_dict()
corr_dict2
# Finding best variables as per correlation
best_columns=[]
for key,value in corr_dict2.items():
if ((value>=0.30) & (value<0.9)) | (value<=-0.30):
best_columns.append(key)
best_columns
print(len(best_columns))
# As SalePrice is right skewed, make a new column taking its logarithmic trnasformation
train['SalePrice_Log1p'] = np.log1p(train.SalePrice)
print(min(train['SalePrice_Log1p']))
print(max(train['SalePrice_Log1p']))
plt.figure(figsize=(10,8))
sns.set(font_scale=1.2)
sns.distplot(train['SalePrice'],color='violet')
plt.xlabel('SalePrice',fontsize=20)
print('Skew Dist:',train['SalePrice'].skew())
print('Kurtosis Dist:',train['SalePrice'].kurt())
plt.figure(figsize=(10,8))
sns.set(font_scale=1.2)
sns.distplot(train['SalePrice_Log1p'],color='indigo')
plt.xlabel('SalePrice_Log1p',fontsize=20)
print('Skew Dist:',train['SalePrice_Log1p'].skew())
print('Kurtosis Dist:',train['SalePrice_Log1p'].kurt())
DBSCAN is a density-based clustering approach, and not an outlier detection method per-se. It grows clusters based on a distance measure. Core points -points that have a minimum of points in their surrounding- and points that are close enough to those core points together form a cluster.
We can use DBSCAN as an outlier detection algorithm becuase points that do not belong to any cluster get their own class: -1. The algorithm has two parameters (epsilon: length scale, and min_samples: the minimum number of samples required for a point to be a core point). Finding a good epsilon is critical.
DBSCAN thus makes binary predictions: a point is either an outlier or not. To refine the predictions, we consider the other clusters apart from the main cluster also as outlier clusters, the smaller the cluster, the higher the outlier score.
The used distance function will be the default Euclidean distance. Note that the worst-case performance of DBSCAN is O(n^2), if the neighbourhood scan is a linear scan, which is the case for the sci-kit learn implementation. This significantly limits the dataset size that can be analyzed.
One can also pass a distance matrix instead of a matrix of datapoints to the algorithm, which should reduce the time complexity. A speed-up was however not observed, but a significant memory-load (despite the matrix being sparse), so this is not done.
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA
plt.style.use('ggplot')
plt.figure(figsize=(20,12))
rbst_scaler=RobustScaler()
train_rbst=rbst_scaler.fit_transform(train)
pca=PCA(50).fit(train_rbst)
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xticks(np.arange(0, 50, 1))
plt.xlabel('Number of components',fontweight='bold',size=14)
plt.ylabel('Explanined variance ratio for number of components',fontweight='bold',size=14)
train_pca=PCA(3).fit_transform(train_rbst)
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(train)
distances, indices = nbrs.kneighbors(train)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(15,15))
plt.plot(distances)
dbscan = DBSCAN(eps=1400, min_samples=20).fit(train_pca)
core_samples_mask = np.zeros_like(dbscan.labels_, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True
labels=dbscan.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
unique_labels = set(labels)
plt.figure(figsize=(12,12))
colors = [plt.cm.prism(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = train_pca[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = train_pca[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
labels=pd.DataFrame(labels,columns=['Classes'])
print(labels[labels['Classes']==-1])
train=pd.concat([train,labels],axis=1)
train[train.Classes==-1]
# As these are the outliers, lets drop them
train.drop([197,810,1170,1182,1298,1386,1423],axis=0,inplace=True)
train.shape, test.shape
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(20,80))
fig.subplots_adjust(hspace=0.6)
colors=[plt.cm.prism_r(each) for each in np.linspace(0, 1, len(best_columns))]
for i,ax,color in zip(best_columns,axes.flatten(),colors):
sns.regplot(x=train[i], y=train["SalePrice"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.8},color=color,ax=ax)
plt.xlabel(i,fontsize=12)
plt.ylabel('SalePrice',fontsize=12)
ax.set_yticks(np.arange(0,900001,100000))
ax.set_title('SalePrice'+' - '+str(i),color=color,fontweight='bold',size=20)
plt.style.use('ggplot')
fig, axes = plt.subplots(18, 2,figsize=(20,60))
fig.subplots_adjust(hspace=0.8)
sns.set(font_scale=1.2)
colors=[plt.cm.prism_r(each) for each in np.linspace(0, 1, len(best_columns))]
for i,ax,color in zip(best_columns,axes.flatten(),colors):
sns.regplot(x=train[i], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color=color,ax=ax)
plt.xlabel(i,fontsize=12)
plt.ylabel('SalePrice_Log1p',fontsize=12)
ax.set_title('SalePrice_Log1p'+' - '+str(i),color=color,fontweight='bold',size=20)
best_columns
#"""'OverallQual',Total_Square_Feet,GrLivArea,GarageCars
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['OverallQual'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,2)
sns.regplot(x=train['Total_Square_Feet'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,3)
sns.regplot(x=train['GrLivArea'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,4)
sns.regplot(x=train['GarageCars'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
# single outlier in Total_Square_Feet
train['Total_Square_Feet'].max()
train = train[train['Total_Square_Feet']<28.97]
#"""''Total_Bath','GarageArea','TotalBsmtSF','1stFlrSF',
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['Total_Bath'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,2)
sns.regplot(x=train['GarageArea'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,3)
sns.regplot(x=train['TotalBsmtSF'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,4)
sns.regplot(x=train['1stFlrSF'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
#train['GarageArea'].sort_values(ascending = False)
train = train[train['Total_Bath']<5]
train = train[train['GarageArea']<1248]
#"""''YrBltRmd',
#'FullBath',
#'TotRmsAbvGrd',
#'YearBuilt'
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['YrBltRmd'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,2)
sns.regplot(x=train['FullBath'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,3)
sns.regplot(x=train['TotRmsAbvGrd'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.subplot(2,2,4)
sns.regplot(x=train['YearBuilt'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
#'YearRemodAdd',
#'GarageYrBlt',
#'exists_fireplace',
#'Fireplaces',
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['YearRemodAdd'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='r')
plt.subplot(2,2,2)
sns.regplot(x=train['GarageYrBlt'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7}, color='g')
plt.subplot(2,2,3)
sns.regplot(x=train['exists_fireplace'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='b')
plt.subplot(2,2,4)
sns.regplot(x=train['Fireplaces'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
#'MasVnrArea',
#'OpenPorchSF',
#'Total_Porch_Area',
#'LotArea',
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['MasVnrArea'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='r')
plt.subplot(2,2,2)
sns.regplot(x=train['OpenPorchSF'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7}, color='g')
plt.subplot(2,2,3)
sns.regplot(x=train['Total_Porch_Area'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='b')
plt.subplot(2,2,4)
sns.regplot(x=train['LotArea'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
"""'Foundation',
'LotFrontage',
'WoodDeckSF',
'BsmtFinSF1',"""
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['Foundation'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='r')
plt.subplot(2,2,2)
sns.regplot(x=train['LotFrontage'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7}, color='g')
plt.subplot(2,2,3)
sns.regplot(x=train['WoodDeckSF'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='b')
plt.subplot(2,2,4)
sns.regplot(x=train['BsmtFinSF1'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
"""SaleCondition',
'2ndFlrSF',
'BsmtExposure',
'HeatingQC',"""
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['SaleCondition'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='r')
plt.subplot(2,2,2)
sns.regplot(x=train['2ndFlrSF'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7}, color='g')
plt.subplot(2,2,3)
sns.regplot(x=train['BsmtExposure'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='b')
plt.subplot(2,2,4)
sns.regplot(x=train['HeatingQC'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
"""'GarageType',
'GarageFinish',
'House_Age2',
'House_Age1',"""
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['GarageType'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='r')
plt.subplot(2,2,2)
sns.regplot(x=train['GarageFinish'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7}, color='g')
plt.subplot(2,2,3)
sns.regplot(x=train['House_Age2'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='b')
plt.subplot(2,2,4)
sns.regplot(x=train['House_Age1'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
"""'old_house',
'KitchenQual',
'BsmtQual',
'ExterQual']"""
plt.style.use('dark_background')
fig, axes = plt.subplots(18, 2,figsize=(12,10))
#fig.subplots_adjust(hspace=1.2)
plt.subplot(2,2,1)
sns.regplot(x=train['old_house'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='r')
plt.subplot(2,2,2)
sns.regplot(x=train['KitchenQual'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7}, color='g')
plt.subplot(2,2,3)
sns.regplot(x=train['BsmtQual'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7},color='b')
plt.subplot(2,2,4)
sns.regplot(x=train['ExterQual'], y=train["SalePrice_Log1p"], fit_reg=True,marker='o',scatter_kws={'s':50,'alpha':0.7})
plt.show();
train.shape, test.shape
### Lets check correlation now without outliers
plt.style.use('dark_background')
corr1_new_train=train.corr()
plt.figure(figsize=(5,15))
sns.heatmap(corr1_new_train[['SalePrice']].sort_values(by=['SalePrice'],ascending=False).head(25),annot_kws={"size": 16},vmin=-1, cmap='PiYG', annot=True)
sns.set(font_scale=2)
del test['SalePrice']
# making X and y
X=train.drop(['SalePrice','SalePrice_Log1p','Classes'],axis=1)
y=train.SalePrice_Log1p
for i in X.columns:
counts = X[i].value_counts()
#print(i, '\n', counts)
zeros = counts.iloc[0]
print(i, '\n', zeros)
# Finding variables which can create over fitting
def overfit_reducer(df):
overfit = []
for i in df.columns:
counts = df[i].value_counts()
zeros = counts.iloc[0]
if zeros / len(df) * 100 > 99.9:
overfit.append(i)
overfit = list(overfit)
return overfit
overfitted_features = overfit_reducer(X)
overfitted_features
### Lets drop them from X and test
X.drop(['Utilities', 'PoolArea', 'PoolQC', 'exists_pool', 'exists_IstFl'], axis=1, inplace = True)
test.drop(['Utilities', 'PoolArea', 'PoolQC', 'exists_pool', 'exists_IstFl'], axis=1, inplace = True)
print(X.shape)
print(test.shape)
### Scaling
from sklearn.preprocessing import StandardScaler,RobustScaler,LabelEncoder,PowerTransformer
std_scaler=StandardScaler()
rbst_scaler=RobustScaler()
power_transformer=PowerTransformer()
X_std=std_scaler.fit_transform(X)
X_rbst=rbst_scaler.fit_transform(X)
X_pwr=power_transformer.fit_transform(X)
test_std=std_scaler.transform(test)
test_rbst=rbst_scaler.transform(test)
test_pwr=power_transformer.transform(test)
### Splitting the data in X_train,X_test,y_train,y_test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_std,y,test_size=0.2,random_state=52)
print('X_train Shape :',X_train.shape)
print('X_test Shape :',X_test.shape)
print('y_train Shape :',y_train.shape)
print('y_test Shape :',y_test.shape)
# Import common MLA libraries
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier, XGBRegressor
from sklearn.neural_network import MLPRegressor
# Regression problem
MLA = [
# Linear Regression
linear_model.LinearRegression(),
linear_model.Ridge(),
# Decision Tree
tree.DecisionTreeRegressor(random_state = 0),
#Ensemble Methods
ensemble.BaggingRegressor(random_state = 0),
ensemble.ExtraTreesRegressor(random_state = 0),
ensemble.GradientBoostingRegressor(random_state = 0),
ensemble.RandomForestRegressor(random_state = 0),
#Nearest Neighbor
neighbors.KNeighborsRegressor(),
#xgboost
XGBRegressor(random_state = 0),
]
#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy', 'MLA Test Accuracy', 'MLA RMSE', 'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
row_index = 0
for alg in MLA:
#index through MLA and save performance to table
#set name and parameters
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
#score model
fitted_alg = alg.fit(X_train, y_train)
MLA_compare.loc[row_index, 'MLA Train Accuracy'] = fitted_alg.score(X_train, y_train)
MLA_compare.loc[row_index, 'MLA Test Accuracy'] = fitted_alg.score(X_test, y_test)
#MLA_compare.loc[row_index, 'MLA Time'] = fitted_alg.fit_time(X_test, y_test)
"""cv_results['fit_time'].mean()"""
# RMSE score
MLA_compare.loc[row_index, 'MLA RMSE'] = np.sqrt(mean_squared_error(y_test, alg.predict(X_test)))
row_index+=1
#print and sort table
MLA_compare.sort_values(by = ['MLA RMSE'], ascending = True, inplace = True)
MLA_compare
from sklearn.model_selection import RandomizedSearchCV, KFold,GridSearchCV
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
# RidgeCV
kfolds = KFold(n_splits=10, shuffle=True, random_state=42)
#{'alpha': 1.0, 'copy_X': True, 'fit_intercept'...
alphas=[1e-9,1e-8,1e-7,1e-6]
ridgecv_reg= make_pipeline(RidgeCV(alphas=alphas, cv=kfolds))
ridgecv_reg.fit(X_train, y_train)
y_head=ridgecv_reg.predict(X_test)
print('-'*10+'RidgeCV'+'-'*10)
print('R square Accuracy: ',r2_score(y_test,y_head))
print('Mean Absolute Error Accuracy: ',mean_absolute_error(y_test,y_head))
print('Mean Squared Error Accuracy: ',mean_squared_error(y_test,y_head))
import lightgbm as lgb
import xgboost as xgb
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, KFold,GridSearchCV
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
from sklearn.preprocessing import StandardScaler,RobustScaler,LabelEncoder,PowerTransformer
from sklearn.ensemble import GradientBoostingRegressor,StackingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
#lightgbm
lgb_regressor=lgb.LGBMRegressor(objective='regression', num_leaves=5, learning_rate=0.035, n_estimators=2177, max_bin=50, bagging_fraction=0.65,bagging_freq=5, bagging_seed=7,
feature_fraction=0.201, feature_fraction_seed=7,n_jobs=-1)
lgb_regressor.fit(X_train, y_train)
y_head=lgb_regressor.predict(X_test)
print('-'*10+'LGBM'+'-'*10)
print('R square Accuracy: ',r2_score(y_test,y_head))
print('Mean Absolute Error Accuracy: ',mean_absolute_error(y_test,y_head))
print('Mean Squared Error Accuracy: ',mean_squared_error(y_test,y_head))
#GradientBoostingRegressor
gb_reg = GradientBoostingRegressor(n_estimators=1992, learning_rate=0.03005, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=14, loss='huber', random_state =42)
gb_reg.fit(X_train, y_train)
y_head=gb_reg.predict(X_test)
print('-'*10+'GBR'+'-'*10)
print('R square Accuracy: ',r2_score(y_test,y_head))
print('Mean Absolute Error Accuracy: ',mean_absolute_error(y_test,y_head))
print('Mean Squared Error Accuracy: ',mean_squared_error(y_test,y_head))
#LassoCV
kfolds = KFold(n_splits=8, shuffle=True, random_state=42)
lassocv_reg= make_pipeline(LassoCV(alphas=alphas, cv=kfolds))
lassocv_reg.fit(X_train, y_train)
y_head=lassocv_reg.predict(X_test)
print('-'*10+'LassoCV'+'-'*10)
print('R square Accuracy: ',r2_score(y_test,y_head))
print('Mean Absolute Error Accuracy: ',mean_absolute_error(y_test,y_head))
print('Mean Squared Error Accuracy: ',mean_squared_error(y_test,y_head))
#ElsticNetCV
kfolds = KFold(n_splits=8, shuffle=True, random_state=42)
alphas=[0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006]
l1ratio=[0.87, 0.9,0.92, 0.95,0.97, 0.99, 1]
elasticv_reg= make_pipeline(ElasticNetCV(alphas=alphas, cv=kfolds, l1_ratio=l1ratio))
elasticv_reg.fit(X_train, y_train)
y_head=elasticv_reg.predict(X_test)
print('-'*10+'ElasticNetCV'+'-'*10)
print('R square Accuracy: ',r2_score(y_test,y_head))
print('Mean Absolute Error Accuracy: ',mean_absolute_error(y_test,y_head))
print('Mean Squared Error Accuracy: ',mean_squared_error(y_test,y_head))
#StackingRegressor
estimators = [ ('gbr', gb_reg),
('lasso', lassocv_reg),
('ridge', ridgecv_reg),
('elasticnet', elasticv_reg)]
stack_reg=StackingRegressor(estimators=estimators,final_estimator=ExtraTreesRegressor(n_estimators=50),n_jobs=-1)
stack_reg.fit(X_train, y_train)
y_head=stack_reg.predict(X_test)
print('-'*10+'StackingRegressor'+'-'*10)
print('R square Accuracy: ',r2_score(y_test,y_head))
print('Mean Absolute Error Accuracy: ',mean_absolute_error(y_test,y_head))
print('Mean Squared Error Accuracy: ',mean_squared_error(y_test,y_head))
# making dataframe with prediction
y_head=pd.DataFrame(y_head,columns=['Predict'])
y_test.reset_index(drop=True,inplace=True)
y_test_y_head=pd.concat([y_test,y_head],axis=1)
y_test_y_head.head()
# predicting test data with all these models
test_pred_lgb=lgb_regressor.predict(test_pwr)
test_pred_gb=gb_reg.predict(test_pwr)
test_pred_elastic=elasticv_reg.predict(test_pwr)
test_pred_ridge=ridgecv_reg.predict(test_pwr)
test_pred_lasso=lassocv_reg.predict(test_pwr)
test_pred_stack=stack_reg.predict(test_pwr)
# making dataframe with these predictions
test_pred_lgb=pd.DataFrame(test_pred_lgb,columns=['SalePrice'])
test_pred_gb=pd.DataFrame(test_pred_gb,columns=['SalePrice'])
test_pred_elastic=pd.DataFrame(test_pred_elastic,columns=['SalePrice'])
test_pred_ridge=pd.DataFrame(test_pred_ridge,columns=['SalePrice'])
test_pred_lasso=pd.DataFrame(test_pred_lasso,columns=['SalePrice'])
test_pred_stack=pd.DataFrame(test_pred_stack,columns=['SalePrice'])
# getting anti log (to get Sale Price)
test_pred_lgb.SalePrice =np.floor(np.expm1(test_pred_lgb.SalePrice))
test_pred_gb.SalePrice =np.floor(np.expm1(test_pred_gb.SalePrice))
test_pred_elastic.SalePrice =np.floor(np.expm1(test_pred_elastic.SalePrice))
test_pred_ridge.SalePrice =np.floor(np.expm1(test_pred_ridge.SalePrice))
test_pred_lasso.SalePrice =np.floor(np.expm1(test_pred_lasso.SalePrice))
test_pred_stack.SalePrice =np.floor(np.expm1(test_pred_stack.SalePrice))
# taking average of all models
final_pred=(test_pred_stack+test_pred_lgb+test_pred_ridge+test_pred_gb + test_pred_elastic + test_pred_lasso)/6
final_pred.head()
#Lets check accuracy on test data of kaggle by submitting predictions in form of submission file
y_pred_test = (test_pred_stack+test_pred_lgb+test_pred_ridge+test_pred_gb + test_pred_elastic + test_pred_lasso)/6
temp = pd.DataFrame(pd.read_csv("/Users/tuktuk/Downloads/house-prices-advanced-regression-techniques/test.csv")['Id'])
temp['SalePrice'] = y_pred_test
temp.to_csv("/Users/tuktuk/Downloads/house-prices-advanced-regression-techniques/submission23Sep1.csv", index = False)
Image("/Users/tuktuk/Desktop/Screenshot 2020-09-23 at 10.28.02 PM.png")
I would like to discuss Linear Regression algorthms(Linear, Ridge, Lasso, ElasticNet) for a better understanding. This way I would review what I know and at the same time help out the community. If you already know enough about Linear Regression, you may skip this part and go straight to the part where I fit the model. However, if you take your time to read this and other model description sections and let me know how I am doing, I would genuinely appreciate it. Let's get started.
We will start with one of the most basic but useful machine learning model, Linear Regression. However, do not let the simplicity of this model fool you, as Linear Regression is the base some of the most complex models out there. For the sake of understanding this model, we will use only two features, SalePrice and GrLivArea. Let's take a sample of the data and graph it.
sample_train = train_orig.sample(300)
import seaborn as sns
plt.subplots(figsize = (15,8))
ax = plt.gca()
ax.scatter(sample_train.GrLivArea.values, sample_train.SalePrice.values, color ='b');
plt.title("Chart with Data Points");
plt.subplots(figsize = (15,8))
ax = plt.gca()
ax.scatter(sample_train.GrLivArea.values, sample_train.SalePrice.values, color ='b');
#ax = sns.regplot(sample_train.GrLivArea.values, sample_train.SalePrice.values)
ax.plot((sample_train.GrLivArea.values.min(),sample_train.GrLivArea.values.max()), (sample_train.SalePrice.values.mean(),sample_train.SalePrice.values.mean()), color = 'r');
plt.title("Chart with Average Line");
## Calculating Mean Squared Error(MSE)
sample_train['mean_sale_price'] = sample_train.SalePrice.mean()
sample_train['mse'] = np.square(sample_train.mean_sale_price - sample_train.SalePrice)
sample_train.mse.mean()
## getting mse
print("Mean Squared Error(MSE) for average line is : {}".format(sample_train.mse.mean()))
Introducing Linear Regression, one of the most basic and straightforward models. Many of us may have learned to show the relationship between two variable using something called "y equals mX plus b." Let's refresh our memory and call upon on that equation.
m = slope of the regression line. It represents the relationship between X and y. In another word, it gives weight as to for each x(horizontal space) how much y(vertical space) we have to cover. In machine learning, we call it coefficient.
b = y-intercept.
x and y are the data points located in x_axis and y_axis respectively.
And, this is the equation for a simple linear regression. Here,
y = Dependent variable. This is what we are trying to estimate/solve/understand.
B0 = the y-intercept, it is a constant and it represents the value of y when x is 0.
B1 = Slope, Weight, Coefficient of x. This metrics is the relationship between y and x. In simple terms, it shows 1 unit increase in y changes when 1 unit increases in x.
x1 = Independent variable ( simple linear regression ) /variables.
e = error or residual.
residual = 𝑦𝑖−𝑦̂𝑖
Image("/Users/tuktuk/Desktop/Images/Image 5.png") # Image 5
Image("/Users/tuktuk/Desktop/Images/Image 6.png") #Image 6
## Calculating the beta coefficients by hand.
## mean of y.
y_bar = sample_train.SalePrice.mean()
## mean of x.
x_bar = sample_train.GrLivArea.mean()
## Std of y
std_y = sample_train.SalePrice.std()
## std of x
std_x = sample_train.GrLivArea.std()
## correlation of x and y
r_xy = sample_train.corr().loc['GrLivArea','SalePrice']
## finding beta_1
beta_1 = r_xy*(std_y/std_x)
## finding beta_0
beta_0 = y_bar - beta_1*x_bar
## getting y_hat, which is the predicted y values.
sample_train['Linear_Yhat'] = beta_0 + beta_1*sample_train['GrLivArea']
# create a figure
fig = plt.figure(figsize=(15,7))
# get the axis of that figure
ax = plt.gca()
# plot a scatter plot on it with our data
ax.scatter(sample_train.GrLivArea, sample_train.SalePrice, c='b')
ax.plot(sample_train['GrLivArea'], sample_train['Linear_Yhat'], color='r');
## getting mse
print("Mean Squared Error(MSE) for regression line is : {}".format(np.square(sample_train['SalePrice']
- sample_train['Linear_Yhat']).mean()))
from sklearn.metrics import mean_squared_error
mean_squared_error(sample_train['SalePrice'], sample_train.Linear_Yhat)
## Creating a customized chart. and giving in figsize and everything.
fig = plt.figure(constrained_layout=True, figsize=(15,5))
## creating a grid of 3 cols and 3 rows.
grid = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)
#gs = fig3.add_gridspec(3, 3)
#ax1 = fig.add_subplot(grid[row, column])
ax1 = fig.add_subplot(grid[0, :1])
# get the axis
ax1 = fig.gca()
# plot it
ax1.scatter(x=sample_train['GrLivArea'], y=sample_train['SalePrice'], c='b')
ax1.plot(sample_train['GrLivArea'], sample_train['mean_sale_price'], color='k');
# iterate over predictions
for _, row in sample_train.iterrows():
plt.plot((row['GrLivArea'], row['GrLivArea']), (row['SalePrice'], row['mean_sale_price']), 'r-')
ax2 = fig.add_subplot(grid[0, 1:])
# plot it
ax2.scatter(x=sample_train['GrLivArea'], y=sample_train['SalePrice'], c='b')
ax2.plot(sample_train['GrLivArea'], sample_train['Linear_Yhat'], color='k');
# iterate over predictions
for _, row in sample_train.iterrows():
plt.plot((row['GrLivArea'], row['GrLivArea']), (row['SalePrice'], row['Linear_Yhat']), 'r-')
Now, we need to introduce a couple of evaluation metrics that will help us compare and contrast models. One of them is mean squared error(MSE) which we used while comparing two models. Some of the other metrics are...
Image("/Users/tuktuk/Desktop/Images/Image 7.png") # Image 7
Image("/Users/tuktuk/Desktop/ML/Screenshot 2020-09-16 at 1.06.20 PM.png") # Image 8
Image("/Users/tuktuk/Desktop/Images/Image 8.png") #Image 9
Image("/Users/tuktuk/Desktop/Images/Image 9.png") # Image 9
Image("/Users/tuktuk/Desktop/Images/Image 10.png") # Image 10
Image("/Users/tuktuk/Desktop/Images/Image 11.png") # Image 11
Image("/Users/tuktuk/Desktop/Images/Image 12.png") # image 12
Image("/Users/tuktuk/Desktop/Images/Image 13.png") # Image 13
Image("/Users/tuktuk/Desktop/Images/Image 14.png") # Image 14
Image("/Users/tuktuk/Desktop/Images/Image 15.png") # Image 15
## importing necessary models.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
## Call in the LinearRegression object
lin_reg = LinearRegression(normalize=True, n_jobs=-1)
## fit train and test data.
lin_reg.fit(X_train, y_train)
## Predict test data.
y_pred = lin_reg.predict(X_test)
## get average squared error(MSE) by comparing predicted values with real values.
print ('%.2f'%mean_squared_error(y_test, y_pred))
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score, KFold
lin_reg = LinearRegression()
cv = KFold(shuffle=True, random_state=2, n_splits=10)
scores = cross_val_score(lin_reg, X,y,cv = cv, scoring = 'neg_mean_absolute_error')
print ('%.8f'%scores.mean())
Regularization Models
What makes regression model more effective is its ability of regularizing. The term "regularizing" stands for models ability to structurally prevent overfitting by imposing a penalty on the coefficients.
There are three types of regularizations.
Ridge, Lasso and Elastic Net
These regularization methods work by penalizing the magnitude of the coefficients of features and at the same time minimizing the error between the predicted value and actual observed values. This minimization becomes a balance between the error (the difference between the predicted value and observed value) and the size of the coefficients. The only difference between Ridge and Lasso is the way they penalize the coefficients. Elastic Net is the combination of these two. Elastic Net adds both the sum of the squares errors and the absolute value of the squared error. To get more in-depth of it, let us review the least squared loss function.
Image("/Users/tuktuk/Desktop/Images/Image 16.png") # Image 16
One of the benefits of regularization is that it deals with multicollinearity(high correlation between predictor variables) well, especially Ridge method. Lasso deals with multicollinearity more brutally by penalizing related coefficients and force them to become zero, hence removing them. However, Lasso is well suited for redundant variables.
Image("/Users/tuktuk/Desktop/Images/Image 17.png") # Image 17
## Importing Ridge.
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error
## Assiging different sets of alpha values to explore which can be the best fit for the model.
alpha_ridge = [-3,-2,-1,1e-15, 1e-10, 1e-8,1e-5,1e-4, 1e-3,1e-2,0.5,1,1.5, 2,3,4, 5, 10, 20, 30, 40]
temp_rss = {}
temp_mse = {}
for i in alpha_ridge:
## Assigin each model.
ridge = Ridge(alpha= i, normalize=True)
## fit the model.
ridge.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
y_pred = ridge.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rss = sum((y_pred-y_test)**2)
temp_mse[i] = mse
temp_rss[i] = rss
for key, value in sorted(temp_mse.items(), key=lambda item: item[1]):
print("%s: %s" % (key, value))
for key, value in sorted(temp_rss.items(), key=lambda item: item[1]):
print("%s: %s" % (key, value))
Image("/Users/tuktuk/Desktop/Images/Image 18.png") # Image 18
from sklearn.linear_model import Lasso
temp_rss = {}
temp_mse = {}
for i in alpha_ridge:
## Assigin each model.
lasso_reg = Lasso(alpha= i, normalize=True)
## fit the model.
lasso_reg.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
y_pred = lasso_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rss = sum((y_pred-y_test)**2)
temp_mse[i] = mse
temp_rss[i] = rss
for key, value in sorted(temp_mse.items(), key=lambda item: item[1]):
print("%s: %s" % (key, value))
for key, value in sorted(temp_rss.items(), key=lambda item: item[1]):
print("%s: %s" % (key, value))
Image("/Users/tuktuk/Desktop/Images/Image 19.png") # Image 19
from sklearn.linear_model import ElasticNet
temp_rss = {}
temp_mse = {}
for i in alpha_ridge:
## Assigin each model.
lasso_reg = ElasticNet(alpha= i, normalize=True)
## fit the model.
lasso_reg.fit(X_train, y_train)
## Predicting the target value based on "Test_x"
y_pred = lasso_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rss = sum((y_pred-y_test)**2)
temp_mse[i] = mse
temp_rss[i] = rss
for key, value in sorted(temp_mse.items(), key=lambda item: item[1]):
print("%s: %s" % (key, value))
for key, value in sorted(temp_rss.items(), key=lambda item: item[1]):
print("%s: %s" % (key, value))